Efficient High-precision Boilerplate Detection Using Multilayer Perceptrons
نویسنده
چکیده
Removal of boilerplate is among the essential tasks in web corpus construction and web indexing. In this paper, we present an improved machine learning approach to general-purpose boilerplate detection for languages based on (extended) Latin alphabets (easily adaptable to other scripts). We keep it highly efficient (around 320 documents per single CPU core second) by using an optimized Multilayer Perceptron implementation while achieving around 95% correct classifications (Precision, Recall, and F1 score over 0.95) by extracting suitable text block-internal features. We finally compare the performance of the Multilayer Perceptron to that of other classifiers such as Support Vector Machines.
منابع مشابه
Accurate and efficient general-purpose boilerplate detection for crawled web corpora
Removal of boilerplate is one of the essential tasks in web corpus construction and web indexing. Boilerplate (redundant and automatically inserted material like menus, copyright notices, navigational elements, etc.) is usually considered to be linguistically unattractive for inclusion in a web corpus. Also, search engines should not index such material because it can lead to spurious results f...
متن کاملEfficient estimation of multidimensional regression model using multilayer perceptrons
This work concerns the estimation of multidimensional nonlinear regression models using multilayer perceptrons (MLPs). The main problem with such models is that we need to know the covariance matrix of the noise to get an optimal estimator. However, we show in this paper that if we choose as the cost function the logarithm of the determinant of the empirical error covariance matrix, then we get...
متن کاملAre Rosenblatt multilayer perceptrons more powerfull than sigmoidal multilayer perceptrons? From a counter example to a general result
In the eighties the problem of the lack of an efficient algorithm to train multilayer Rosenblatt perceptrons was solved by sigmoidal neural networks and backpropagation. But should we still try to find an efficient algorithm to train multilayer hardlimit neuronal networks, a task known as a NP-Complete problem? In this work we show that this would not be a waste of time by means of a counter ex...
متن کاملDetection of Text with Connected Component Clustering
Text detection and recognition is a hot topic for researchers in the field of image processing. It gives attention to Content based Image Retrieval (CBIR) community in order to fill the semantic gap between low level and high level features. Several methods have been developed for text detection and extraction that achieve reasonable accuracy for natural scene text (camera images) as well as mu...
متن کاملAutomatic Microfossil Detection in Somosaguas Sur paleontologic site (Pozuelo de Alarcón, Madrid, Spain) using Multilayer Perceptrons
Microvertebrate fossils are used in biochronology to determinate the age of geological layers with a high grade of accuracy, and in paleoecology to extract information about the past enviroment. Actual techniques used to extract microfossils are manual, and require of a high amount of time and human resources. This fact make interesting the study of other more complex techniques. The work prese...
متن کامل